Implement zero-copy tokenization for identifiers, strings, and comments by eyalleshem · Pull Request #2136 · apache/datafusion-sqlparser-rs

eyalleshem · 2025-12-18T20:49:58Z

This PR implements zero-copy tokenization by using borrowed strings (&str) instead of owned strings (String) for identifiers, string literals, and comments. This eliminates unnecessary string allocations during the tokenization
process.

Changes

Modified Token variants to store &'a str instead of String for:
- Word tokens (identifiers like table/column names)
- SingleQuotedString literals
- Whitespace
- Comments (single-line and multi-line)
Implemented case-insensitive keyword lookup without to_uppercase() allocation
Added tokenize_bench criterion benchmark for performance measurement

Performance Impact

Benchmark results using a complex 27KB SQL query with CTEs, joins, window functions, and extensive comments:

tokenization/tokenize_complex_sql
time: [254.68 µs 254.81 µs 254.97 µs]
change: [−60.885% −60.682% −60.482%] (p = 0.00 < 0.05)
Performance has improved.

This change introduces a lifetime parameter 'a to BorrowedToken enum to prepare for zero-copy tokenization support. This is a foundational step toward reducing memory allocations during SQL parsing. Changes: - Added lifetime parameter to BorrowedToken<'a> enum - Added _Phantom(Cow<'a, str>) variant to carry the lifetime - Implemented Visit and VisitMut traits for Cow<'a, str> to support the visitor pattern with the new lifetime parameter - Fixed lifetime issues in visitor tests by using tokenized_owned() instead of tokenize() where owned tokens are required - Type alias Token = BorrowedToken<'static> maintains backward compatibility

…hitespace Convert token string fields to use Cow<'a, str> to enable zero-copy tokenization for commonly used tokens: - Word.value: Regular identifiers and keywords now borrow from source - SingleQuotedString: String literals borrow when no escape processing needed - Whitespace: Single-line and multi-line comments borrow from source Also add benchmark for measuring tokenization performance

eyalleshem force-pushed the tokenize_with_borrow_1 branch 4 times, most recently from 6773837 to 1c80b40 Compare December 21, 2025 12:57

eyalleshem mentioned this pull request Dec 21, 2025

Improve performance by reducing string copies #2036

Open

eyalleshem force-pushed the tokenize_with_borrow_1 branch from 1c80b40 to 5458a2b Compare December 21, 2025 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement zero-copy tokenization for identifiers, strings, and comments #2136

Implement zero-copy tokenization for identifiers, strings, and comments #2136
eyalleshem wants to merge 2 commits intoapache:reduce-string-copyingfrom
eyalleshem:tokenize_with_borrow_1

eyalleshem commented Dec 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

eyalleshem commented Dec 18, 2025

Changes

Performance Impact

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant